FAST FOURIER TRANSFORM GENERATOR 


By 

A. S. KANADE 


B B 

ri X!:3!-f 
Kf ft ihl 



CH Cl c 
111-3 


DEPARTMENT OF ELECTBIGAL ENGINEERING 


INDIAN INSTITUTE OF TECHNOLOGY KANPUR 


JULY 1975 



FAST FOURIER TRANSFORM GENERATOR 


A Thesis Submitted 

in Partial Fulfilment of the Requirements 
for the Degree of 

MASTER OF TECHNOLOGY 


By 

A. S. KANADE 


to the 


DEPARTMENT OF ELECTRICAL ENGINEERING 

INDIAN INSTITUTE OF TECHNOLOGY KANPUR 

JULY 1975 



4«w. Mo. A iS.5iS^* 

-- ' 

4 FEBWf 

EE' ) T rr-'M -f</=iH- pa£ 



ii 


CBEPIPICATE 


This is t o certify that the project work entitled 
’Past Pourier Transform Generator' has heen carried out 
entirely under my supervision and has not been submitted 
elsewhere for a degree. 




R. N. Biswas 
Assistant Professor 
Department of Electrical Engineering 
Indian Institute of Technology 
Eanpur 



ACKNO¥LEDGEMEKT 


I wisE to convey my deep sense of gratitude to 
Dr. R.N. Biswas for his continuous guidance and encourage- 
ment through out the course of this work. I am particularly 
indebted to him for suggesting some novel ideas without which 
it would not have been possible to continue with this work. 

My sincere thanks are due to the Staff of the 
Electrical Engineering Department who helped towards the 
completion of this project. I also wish to thank 
Mr. K.F. Tewari for his invaluable assistance in typing 
this report. 


A.S. Kanade 



iv 


COlWEHTS 

Page 

IHTRODUGTION 1 

Chapter 1 MATHEMATICAL BACKGROUED OE EFT 4 

Chapter 2 ^STEM DE SIGH CONSIDERATIONS 11 

Chapter 5 ARITHMETIC UNIT 18 

Chapter 4 MAIN CONTROL UNIT 29 

REFERENCES 37 

Appendix Al GENERATION OF ’R’ AND ’I‘ COMMANDS 3® 

Appendix A2 LIST OF CONTROL COMMANDS 40 

Appendix A3 FABRICATION 42 



V 


II ST OF FIGURES 


Uumber 

Description 

Page 

1.1 

(a) Signal flow graph for 8 point FFT 

5 


(b) Basic computation 

5 

2.1 

System block diagram 

16 

3.1 

AU block diagram 

19 

3.2 

AU control flow diagram 

27 

4.1 

Address counter 

33 

A3, la 

Program counter 

43 

A5.1b 

Cycle clock generator 

44 

A3. 2 

Gating controls 

45 

A3. 3 

Clock gating controls 

46 

A3. 4 

ROM decoder 

47 

A3. 5 

Adder/ Subtract or ’A’ 

48 

A3. 6 

Adder /Subtract or 'B’ 

49 

A5.7 

Add er /Subt ract or ' C ' 

50 

A3. 8 

Gating for data selection 

51 

A3. 9 

logic for sign bit generation 

52 

A3. 10 

Trigonometric coefficient register 

53 

A3. 11 

Data register 

54 

A3. 12 

ROM, Buffer register 

54 

A3. 13 

Serial-parallel mult ipli er 

55 

A3. 14 

One of four selector 

56 



ABSTRACT 


An Arithmetic unit capable of performing the 
butterfly computations occurring in a PFT algorithm 
has been designed and fabricated. This All is a separate 
self contained module and can be used to compute radix ’2’ 
FPT of upto 2K number of samples. The conventional ROM 
containing trigonometric coefficients has been replaced 
by a much smaller ROM containing ten incremental 
coefficients. With the help of these, the AU generates 
the required coefficients. The AU must be interconnec- 
ted with the main control unit and a random access 
memory module to form a complete Fast Fourier Transform 
Generator (FFTG) system. 



lUTSODUCTION 


Major advances made by engineers in producing, faster 
data processing systems usually stem from the development of 
Electronic devices. But this is not the case with signal 
and data analysis applications involving Eourier trans- 
formation. Here the advances have bee n triggered not by 
electronics but by mathematics. 

( 

Fourier transformation is a useful tool in 
extracting the information contained in many kinds of 
waveforms such as seismic waves, electronic encaphalograms 
and data signals telemetered from deep space. Many 
approaches have been taken to find the energy content of 
frequencies. One familiar and inexpensive method calls 
for a bank of filters. But this is an analog approach 
which is inherently limited in resolution and flexibility. ' 
These disadvantages .are overcome by taking Discrete Fourier 
Transform (DFT) of a record of samples of a continuous 
waveform. The integral., is replaced by finite summation 

F{f} « DFT(x(n)) = I x(n) exp[-j ^ f.n] 

where x{o), x(l), ...» x(F-l) are the samples of the- 
continuous waveform x(t). The above expression can be 
rewritten as 

. F{k} = I x{n) k = 0,1, ... ,1^-1 

n =:0 

= exp[- 


and W 
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In the straightforward brute force technique, 
all the Fourier coefficients have to be calculated 
separately. In the case of a real function, only 11/2 
coefficients have to be calculated since those corres- 
ponding to more than half the sampling frequency are 
complex conjugate of those below. Thus a total of 
2N.1I/2 = real multiplications and additions are 
required to compute the coefficients. This effort 
becomes phenomenal for any useful number of samples. 

In addition, 211 memory locations are required to store 
the data and the coefficients. 

In 1965 , J.W. Cooley and J.W.Tuhey of Bell 
laboratories developed an algorithm which achieves 
spectacular computational savings over the brute force 
technique, called the 'Past Fourier Transform' (FFT), 
this algorithm has made possible the use of Fourier 
transformation techniques in many fields where it was 
previously too expensive because of the large numbers 
of computations involved. The algorithm is based on 
the fact that if the number of samples is divisible 
by an integer 'r' (1,2,3 •••)» "then the Fourier trans- 
formation problem can be simplified by obtaining the 
transforms of 'r' groups of N/r points and combining the 
results. The total number of calculations is minimised 
by repeating this factoring process in an appropriate 
way. The computational saving is of the order of 
U/log^H. 
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The DPT is a very useful analytical tool which can 
he used for various data processing applications. It can 
he used hy analytical chemists to measure MR spectra. 
Structural designers can use it to determine the transfer 
function and vibration modes of a structure. Brain 
researchers can use it to measure spectra of hrain waves. 
It can also he used hy behavioural scientists, 
psychophysicists, biomedical researchers, process control 
system designers, oceonographers and geophysicists. Most 
of these diversified uses have become possible only 
because of the PPT. 

The details of Cooley-Tukey algorithm are 
described in the first chapter. BecOnd chapter contains 
various aspects of system design, design alternatives and 
the choice of final design. The arithmetic unit, its 
hardware and the operation is described in third chapter, 
A tentative design for the mala control unit is presenr 
ted in the fourth chapter* 
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Chapter 1 

MATHEMATIC AI BACEGROUHD OP PPT 

As descrihed previously, the Cooley-Tuhey algorithm 
is based on the property of factoring the record of samples, 
taking the DPT of the factors and then combining the 
results. If the number of samples H = r^ where 'r* is an 
integer, then the factoring can be continued till only 
•r’ point transforms have to be calculated in the first 
step. The algorithm is then called radix 'r' PPT. Prom 
hardware realization point of view, only radix ’2’ PPT 
is important which is described in the following section. 

1.1 Badix »2» PFT 

If the number of samples H = 2^, then the 
factoring can be continued till there are F/2 groups of 
2 points. These two-point transforms can be calculated 
easily. The two-point transforms are then combined to 
produce four-point transforms, four-point transforms are 
combined to produce eight-point transforms and so on. 

If this process is continued, then after log 2 H such steps, 
the complete ’H’ -point transform is obtained. 

Thus it is observed that for an -point 
transform, there are log 2 N stages of combination. These 
are called iterations. In each iteration there are 
N/B basic computations. The sequence of operation is 
illustrated by the signal flow graph on the next page 
(Pigure 1.1). 
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The entire transform is obtained by repeating a basic 
computational pattern (colloquially called the butterfly). 
The flow graph is constmcted for an eight-point transform; 
however the symmetry of the graph suggests how it can be 
expanded for any number of points. 

^ • 2 The Basic Computation 

The basic computation of FPT is the generation 
of the two-point transform. It consists of accessing 
two complex numbers from two memory locations, multiplying 
the second number by an appropriate trigonometric weight 
W , adding it to the first n-umber and storing back in the 
first location; again m-ultiplying the second number by a 
weight adding it to the first number and 

storing back in the second location. *W* is defined to 
be exp (-2113 /U) . 

¥e observe that since 

vN/2 

= expC-Tcj) = - 1 . 0 . Hence we need to perform only 
one complex multiplication instead of two. Then the 
basic computation is modified as follows. 

Two complex numbers are accessed from two memory 
locations. The second number is multiplied by . The 
result is added to first number and stored in the first 
location, subtracted from the first number and stored in 
the, second location. 
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Thus each computation involves one complex 
multiplication or four real multiplications* The total 
number of real multiplications in the transform becomes 
2 N.log 2 lJ instead of in the brute force technique. 

The saving in computational effort is apparent. For a 
4-K point transform, the saving is of the order of 170. 

1 . 3 The Trigonometric Weights 

For an ’F’ point transform, 'F' complex weights 
are required. These are nothing but 'F* roots of unity. 
As explained in the previous section, only F/2 are 
independent, the remaining F/2 are just negative of 
these independent weights. Furthermore only F/4 
independent numerical values make up these F/2 weights. 
This is because of the fact that sin(9O-0) = cosS . 

For the computation of the transform these 
weights are required in reversed bit order, and not in 
their natural order. Furthermore, not all weights are 
required in all the iterations. Only one weight is 
required in the first iteration, two are required in 
the second iteration, four in the third and so on. 

All F/2 weights are required in the last iteration. 

The reversed bit ordering of weights implies that 
a weight W^. is required not in its natural order, i.e. 
at place, but is required at the k't^ place where 'k’ 
is the reversed bit integer of ’k' . A reversed bit 
integer is obtained by reversing the order of bits 
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in the binary form. The reversed bit ordering is an 
essential feature of the radix '2* FPT algorithm and is 
not confined only to the ordering of weights, as we 
shall observe in the next article. 

1.4 Reversed Bit Ordering of Data and Results 

Another property peculiar to the FPT is that, for 
a naturally ordered sequence of samples, the Fourier 
coefficients do not occur in the natural sequence but 
in the bit reversed sequence. To access a particular 
coefficient, we must find the reversed bit integer of the 
address and access the coefficient from that location. 

For example, in the eight-point FFT, coefficient F(3) 
occurs in location 6, because 6(110) is the reversed 
bit integer of 3(011). 

The above process is called 'decimation in 
frequency' since the Fourier coefficients occur in the 
scrambled form. On the other hand, if Fourier 
coefficients are to be obtained in the natural sequence, 
then the data or the record of samples must be scrambled 
in the reversed bit order. This is called 'decimation 
in time'. 

1.5 Choice of'N* 

Since the time function to be transformed must be 
sampled at discrete intervals, say At, only a finite 
number 'N' of such samples can be taken and stored. The 
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record length T is then 1 = h. At. The effect of finite 

At is t o limit the maximum frequency that may he sampled 

without aliasing error to f^^ = 1/2 At. Any components 

above this Nyquist frequency are folded back onto 

frequencies below f . In practical measurement 

situations, this aliasing presents little or no 

difficulty since f^^^^ can be chosen to include all 

significant components of the input signal or a low- 

pass filter may be used before the sampler to eliminate 

any strong components above f „ , The spectral 

max 

resolution is given by A f = i/i i.e, to get fine 
resolution, record length must be large enough. 

In general, f„^^ is limited to a few ZHz in 
many practical systems* Por example, MMR spectrometry 
requires a range of 2000 Hz., with a resolution of 1 Hz. 
This implies that sampling rate must be atleast 4000 Hz 
and the record length 1 second, so that H = 4000. 

U is chosen to be 4096 or 4~K in this case for the 
FFT to be applicable. 

1 . 6 Address Generatio n 

Each butterfly computation requires a pair of 
addresses. These addresses follow a repetitive logical 
pattern. The sequence of addresses is different in 
each iteration. ¥e can determine the logic for 8-point 
EPT and extrapolate the results for an 'H’ point EFT. 
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In the example, the address pairs in the 
first iteration are (000,100), (001,101), (010,110) 
and (Oil, 111). It is observed that second address 
in a pair is obtained by inverting the first bit of 
the first address. The address pairs are (000,010), 
(001,011), (100,110) and (101, ill) in the second 
iteration. Here the second address is obtained by 
inverting the second bit of the first address. In 
the third and last iteration, address pairs are 
(000,001), (010,011), (100,101) and (110,111). Here 
the third bit of the first address is inverted to 
obtain the second address. The logical pattern suggests 
that in the iteration, the second address in a 
pair can be obtained by inverting the k bit of the 
first address. 

Again observing the address pairs of the example, 
we find that the first bit of the first address is 
always zero in the first iteration, second is zero in 
the second iteration and third is zero in the third 
iteration. Thus the first addresses are (000,001,010 
and Oil) in the first iteration, (000, 001,100 and 101) in 
the second iteration and (000,010,100 and 110) in the 
third iteration, ¥e observe that the sequence is 
strictly natural bina-ry if the permanent zero bit is 
suppressed. 
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Chapt er 2 

SYSTEM DESI&E CONSIDERATIONS 

Since the East Eourier Transform Generator (EFTG) 
is a special purpose hardwired calculator, its organi- 
zation is essentially the same as a computer. It must 
have a memory for storing data and results, a processing 
unit which carries out the calculations and a main 
control unit which controls the processor and the memory. 
There are two main alternatives for the processing unit 
which are given helow. 

2.1 Sequential Versus Cascade Organization 

The task of the processing unit is to carry out 
the butterfly computations. In the sequential organi- 
zation, the processing unit has one Arithmetic Unit (AU) 
which is hardwired to carry out a basic computation. It 
carries them, out one by one and is the slowest and 
cheapest of all organizations. In cascade organization, 
the processing unit has log 2 N arithmetic units which 
start computing all the iterations more or less 
simultaneously. This organization is costly and is 
justified only if a real time analysis must be done or 
a large amount of data must be processed at a very high 
speed. The sequential processor is much cheaper and ideal 
for off line requirements. Since our aim is not processing 
of real time data, the sequential organization is chosen. 



12 


2 . 2 Record Length of Samples 

As calculated in Section 1.5, 4000 point FPT is 
sufficient in most of the practical situations. To meet 
the radix *2' requirement, we choose maximum R to be 
4096 or 4-rK points. It is also desirable to be able to 
generate FFT of lesser number of points without changing 
the hardware. This brings us to the concept of modularity 
which is dealt with later in more detail in Chapter 4. 

2.5 Word length 

There are two main forms in which the FFTG- can 
output the results. One is to plot the Fourier coefficients 
on an x-y recorder or display them on an oscilloscope. 

Other is to print them out in decimal form on a typewrit er. 
The second form is necessary where results must be 
accurately known for further analysis. The direct 
display or plotting does not require more than 8 bits/ 
word accuracy. For the second alternative, number of 
bits must be more than 12/word to take care of the 
computational errors. Hence the tentative word length 
must be 12-16 bits/word. Greater word length costs more 
in terms of hardware. For very good accuracy, it is better 
to use a general purpose computer system. 

2*4 Alternatives for Weight Generation 

As mentioned earlier, an ’H' point FFT requires 
H/2 trigonometric weights. These constitute lT/4 
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independent numerical values. The first alternative is 
to have a Read Only Memory which stores these sr/4 indepen- 
dent values. For 4-K point RPT, we require a ROM with 
the capacity of storing 1-K words, 12-16 bits/word. Such 
ROM’s are not available as off the shelf items and 
getting them custom made is costly, unless required in 
large quantities. 

The second alternative is to generate these 
trigonometric values in binary form as a multiple output 
switching function whose input is the address of the 
weight. This synthesis is almost impossible without the 
availability of efficient computer programs for 
minimization of truth tables. Even with their help, the 
necessary number of gates for realization may turn out to 
be very large. 

The third alternative is to compute the required 
complex weight. This possibility is especially attractive 
as the AU already has the capability of generating new 
complex numbers by complex multiplications. In the next 
section, we look into this alternative in more detail. 

2 . 5 Complex Weight Generation 

As pointed out in Section 1.3» the weights are 
always required in reversed bit order. This fixed 
ordering indicates that as we know the next weight 
required, it can be generated by multiplication of the 
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previous weight, and an appropriate complex number. This 
complex number is nothing but 'W raised to the power of 
the difference between the powers of the previous weight 
and the new weight. 

A computer program was run to arrange the weights 
in reversed bit order and calculate the increments. It 
turned out that 11 independent weights were required for 
4-E point FPT, 10 for 2-K point FPT and 9 for 1~K point 
FPT. Of these, one is trivial, its value being 
exp(- “) = ~j . Multiplication by this just implies 
interchanging the real and imaginary parts of the 
previous weight and inverting the sign of the new real 
part. Thus the number of nontrivial weight increments 
turns out to be log2H'-2 for ’N' point PFT which is 10 
for 4-Z point PPT and 9 for 2-E point PPT. This number 
is small enough for the ROM to be realized using diode 
matrices. All the other weights can be realized by 
multiplying the previous wei^t and one of these chosen 
according to a fixed pattern. This pattern is described 
in detail in Section 3«1 aiih 4*5* 

2.6 System Description 

Memory : Por a 4-K point PPT, 4-K locations for 
storing real and 4-K for storing imaginary parts of 
the Pourier coefficients are necessary. Since the 
memory module is the most costly of all,, making it 
dedicated for the use of PPT G- isL uneconomical . Hence 
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it was decided to fabricate a general purpose random 
access memory with single word read/write capability. 

The memory is a 4-K word, 20 bits/word memory. Hence 
it can be used for calculating a 2-K point EFT. 

Main Oontrol ; Main control generates addresses for the 
butterfly computations and sends other control commands 
to AU and memory. The logic for address generation has 
already been dealt with in Section 1.6. The other 
details are given in Chapter 4. 

Arithmetic Unit : Design aspects of AU are given in 
details in the next section. 

It was decided to fabricate all three subsystems 
separately and interconnect them later whenever desired. 
Two of the subsystems memory and main control can find 
use as parts of other systems as well. The system 
block diagram is given on the next page (Figure 2.1). 

2**^ Design Considerations for AU 

The AU has to perform two complex multiplications, 
two additions and two subtractions. In terms of basic 
computations, this involves eight real multiplications, 
four additions and four subtractions. These operations 
can be performed in bit serial or bit parallel form. 
Though bit parallel operations are very much faster, 
they are much more costly in terms of increased hardware. 
Hence the bit serial form of addi'i'ion and subtraction was 


chosen 
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Si^igle Multiplier Versus Four Multipliers 

Each complex multiplication requires four real 
multiplications. These can he done sequentially hy one • 
multiplier or together hy four multipliers. The single 
multiplier scheme requires lesser processor hardware. On 
the other hand, the AU control becomes more complex 
because it has to control eight multiplications and 
associated real and imaginary additions and subtractions 
which take place sequentially. 

The four multiplier scheme requires more processor 
hardware but is faster than the single multiplier scheme. 
Moreover, the AU conurol is much simpler because it can 
control all the operations in parallel with the help 
of same control commands. It turns out that the total 
hardware including both processor and control, is 
about 25 percent more in the case of the fotir multiplier 
scheme. However, its execution time is about one third 
of that obtainable with the single multiplier scheme, 
which is a substantial speed gain for a small additional 
amount of hardware. The four multiplier scheme was 


chosen for the AU, 
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Chapt er 3 
ARITHMETIC UNIT 


The Arithmetic unit of the EETG performs one 
complex multiplication, two additions and two subtractions 
involved in the butterfly computation. It also performs 
an additional complex multiplication when required to 
generate the new weight. The basic design decisions for 
the AU were given in the last chapter. In this chapter 
we examine the AU hardware in detail. The AU. block 
diagram is given on the next page (Eigure 3.1). 

The AU hardware consists of four multipliers, 
one adder, one subtracter, two data registers, one ROM 
buffer register, one weight register, two buffer registers, 
six serial two’s complementers, some flip-flops to store 
the sign bits of data and intermediate results and some 
additional gating and combinatorial logic. It also has 
a Read only Memory and decoder which are described below. 

5*1 ROM for Weight Increments 

As was mentioned in the previous chapter, there 
are 10 independent nontrivial increments which must be 
preserved in a ROM. These 10 increments are for 4-K 
point EET, but the same increments hold good for any 
other ’N' point EFT for ’N* less 'I'han Each 

increment is a complex number with a real and an 
imaginary part. Bach number has 14 magnitude bits and 
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one sign "bit.- The choice of 14 as numher of magnitude 
hits is explained in the next section. Thus a ROM 
having 20 x 15 organization is indicated. Each of the 
ten lines selects two 15--hit words. To select one of 
these increments, a counter decoder is used. To 
generate the logic for this decoder, a computer program 
was run to arrange the increments in the order ichey are 
required. By looking at the time of occurance of the 
increments, a simple logical pattern emerges which 
consists of detecting particular bit patterns of a 
10 bit counter. There are ten such patterns for the 
ten coefficients. Each is realized with the help of 
multi- input RAID, gates. Incrementing the counter itself 
requires another logical pattern and is under the control 
of main control unit. 

5*2 Word Size for the AU 

It is convenient to use 16 bit registers in the 
AU. If one bit is to be left for overflow which will 
always result if data is noimalized to 1, then this 
leaves us with 15' bits. Of these 14 bits are . magnitude 
bits and one bit in the sign bit. 

The overflow results because of the repeated 
additions and subtractions of intermediate results. This 
also causes the Fourier coefficients to be computed 
within a scale factor of 'R’ which will require longer 
registers. To scale down the coefficients, they must 
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be divided by ’U'. One simple way to do this is to 
shift the result one bit to the right in every iteration. 
This amounts to division by ^og 2 N = N. 

3 • 3 Interpretation of numbers 

The numbers occuring in FFT can be interpreted 
either as fractions or integers. If they are inter- 
preted as integers, the binary multipliers have to 
preserve the full product, necessitating more hardware. 

On the other hand, interpretation as fractions leaves 
us free to reject the lower order bits of the product 
and still retain the same interpretation. Hence for 
all further discussions we assume the data to be 
normalized to 1, with result also being normalized to 
1 after one right shift. 


3*4 The Complex Multiplier 

— 

This consists of fbur^ binary multipliers. Each 
multiplier is identical and multiplies two 16 bit 
numbers. The product is 16 bits long since the least 
significant 16 bits, are rejected. The multipliers use 
the add and shift algorithm. Each multiplier has two 
16 bit registers, a 16 bit adder and 16 AM) gates. Each 
multiplier forms one of the four products needed to 
generate a complex product. The outputs of the multipliers 
are hardwired to the inputs of the adder and subtractor. 

The output of the subtractor forms the real part and the 
output of the adder forms the imaginary part of the 
complex product. 
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3*5 The Adder and Subtract or 

The inputs to the adder and subtract or are from 
the multipliers and the data registers. Either can be 
selected with the help of gating. The same adder and 
subtractor is used in complex multiplication as well as 
the addition and subtraction involved in the butterfly 
computations . 

The input numbers are two’s complemented before 
being added or subtracted. The results are converted 
back to sign magnitude form before storing back in data 
registers. Thus there are four two's complementers at 
the input and two at the output. These are standard 
serial two's complementers. The carry flip-flops 
in the adder and subtracter are used to store the sign 
bit of the result which occurs as the most significant 
bit of the result. These sign bits control the operation 
of output two's complementers. 

3.6 The Data and Coefficient Registers 

The AU has a number of registers to store data, 
coefficients and intermediate results. There are two 
data registers which hold the two complex numbers read 
from the SAM. Bach consists of two 16-bit registers. 
These are parallel input, parallel output, serial input, 
serial output registers. ROM buffer register holds the 
complex weight increment read from the ROM. This also 
consists of two parallel input, serial output, 16-bit 
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registers. The weight or coefficient register which 
holds the current trigonometric weight consists of 
two serial input serial output 16~bit registers. These 
must, in addition, have parallel input capability to 
load ¥ or 1-jO at the start of each iteration. All of 
the above registers are fabricated using 7495 universal 
register IC’s. 

In addition, the AU has two 16-bit serial in, 
serial out buffer registers to store the resuilt of the 
adder and subtracter. This temporary storage is neces- 
sary because the sign bit of the result appears in the 
end and the result must be two's complemented before 
being stored in data registers. 

Some flip-flop flags are used to store the sign 
bits of the numbers being processed. The sign bits of 
the multiplicand and multiplier are exclusive OE'd 
to generate sign bit of product. A small amount of AKD-OR- 
logic is used to select the numbers to be multiplied 
and/or the numbers to be added and subtracted. 

5.7 'One of Foxir' Selector 

Since the RAM is single word read write memory, 
the computed results have to be stored one by one. There 
are four real numbers to be stored hence the main control 
unit must select them one by one and write them in the 
'One of four’ selector is suitable for this 


memory. 
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purpose and can be easily fabricated using 2 input ^and 
4 input NAEb gates. While reading from the memory, 'one 
of four' selector is not required since all the register 
inputs can be connected to the memory output simultane- 
ously. Data is selectively written in one of the 
registers by giving a load pulse to that particular 
register. 

3.8 AU Control 

AU control generates the necessary clock pulses 
and gating commands necessary for the operation of the 
AU. Its working is governed by a control algorithm 
which has twelve steps, sixteen clock pulses per step. 

Thus execution of the control algorithm takes a maximum 
of 192 clock pulses. The control algorithm is given 
b el ow . 
dotations ; 

Data^ and Data^ are the two words fetched from the 
memory. 

Data^ = c + jd, Dat^ = e + jf 

The ROM register stores the increment a+jb. The weight 
register stores the previous wei^t g+jh. The multipliers 
are M 2 » M^ and M^. 'I' is the interchange command. 

®1* ®2 buffer registers. C(x) denotes contents 

of register and denotes loading operation. 

Addition and subtraction are in two's complement. 



Begin 


St en 1 : 

Mil IlilllllM—M 1> 

Step 2 : 

Step,J: 

S tep 4 : 

St ep 5 : 

St ep 6 ; 

Step 7 ; 

Step 8; 

St ep 9 ; 

Step 10 : 


C(a) loaded in and 
C(b) loaded in M 2 and 
C(g) mnltiplies C(Mj_) and CCM^) 
C(li) mnltiplies C(M 2 )and C(M^) 
C(g) and C(h) rotated if 1=0 
C(g) and C(li) interchanged if 1=1 
C(Bi)^C(a).C(g) - C(b)-C(h) 
C(B2)^C(a)-C(h) + C(b)-C(g) 
C(g)^C{B^) 

C(h)4^C(B2) 

C(e) loaded in M 2 _ and 
C(f) loaded in M 2 and M^ 

C(g) multiplies and G(M^) 

C(h) multiplies C(M 2 and CCMi^) 
C(g) and.C(h) rotated. 
C(B3_)^C(e)-C(g) - C(f)-C(h) 
C(B2)f-C(e)-G(h) + G(f)*G(g) 
G(e)-f~G(Bi) 

G(f )^G(B2) 

G(B 2 )^G(G) + G(e) 

G(B; 3 _)^G(G) - G(e) 

G(G)^G(B2) 

G(e)4*G(Bi) 


Step 11 : G(B2)'<H‘ G(d) + G(f) 
G(B^)4^ C(d) - G{f ) 
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Step 12 : C{d)^C{B^) 

C(f)^C{B^) 

End 

The pictorial representation of the control 
algorithm is given on the next page (Figure 3* 2). 

As is observed in the control algorithm, each 
complex m-ultiplication takes place in four sxeps. 
Miltiplicand is loaded in the MP register of the 
multiplier in the first step. Multiplication by- 
summation of partial products takes place in the second 
step. The addition and subtraction of real products to 
form real and imaginary parts of the complex product is 
performed in the third step. In the fourth step, the 
complex number generated is loaded back in the coeffi- 
cient register or the data register, as the case may be. 
The first four steps in the control algorithm are 
necessary only for the generation of the ne-w weight. 
This suggests two alternatives for the synthesis of 
AU control 

3,9 Alternatives for AU Control 

The first alternative is to make two separate 

controls. First conxrol operates when the ne-wr weight 

has to be generated and all the twelve steps must be 

i 

executed. Seccaad control operates when only the 
latter eight steps are to be executed. This approach 









































reduces tlie average execution time of the AU but is 
quite costly in terms of requirement of hardware, 

A better way is to make a control which always 
executes twelve steps. The clock is suppressed when 
the first four steps are to be skipped and starts only 
at the beginning of the fifth step. This permits a simple 
count er-de coder type of synthesis for the AU control. 

3.10 Count er-de coder Synthesis 

The cycle counter is a 12x16 state eight bit 
synchronous counter. A trigger pulse sets a latch 
which allows the clock to go through an Al® gate. The 
same clock advances the counter. After 192 pulses have 
passed through, the latch is reset and Ghe clock stops, 

A decoder detects the sta^e of 'R' or commands 
(details in Chapter 5) and allows either 128 or 192 pulses 
to pass through. Another decoder generates gating 
commands according to the state of the counter. A second 
decoder generates clock pulses for various sequential, 
circuits by opening or closing AND gates. 

In addition to AU control, the main control also 
has to give various commands to the AU . The details are 
given in the next chapter. 
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Chapter 4 
MAIU COHTROL URIT 

To execute the FFT, a particular indexing 
pattern is required which is different for each of the 
log 2 l!r iterations. The logic for the address pair 
generation has already been described in Section 1,6. 
Since the pattern is repetitive and simple, it can be 
easily hardwired. For the same reason, modularity • 
can be easily introduced in the control unit so that the 
same unit works for any number of points in the 
c omput at i on of rad ix -2 FFT . 

The main function of the control unit is to 
keep track of the iteration number and accordingly 
generate a pair of addresses. If decimation in 
frequency is used, it is observed that the first 
iteration involves one group of butterflies, second 
involves two groups, third involves four groups and 
so on. The 8 point FFT example illustrates this clearly. 

4,1 Iteration Register 

There are two ways of keeping track of the 
iteration currently in progress. The first one is 
to have a straight binary counter which has log 2 R 
states and hence log 2 (log 2 R) bits. Since the 
iteration number controls other sequential circuits 
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to produce addresses, the output of the "binary counter- 
has to go through substantial amotuit of decoding. The 
second alternative is much raore elegant and easier to 
implement. Instead of a straight binary counter, a 
ring counter or shift register is needed.- Initially 
all the bits of this register which will be called 
'Iteration Register', abbreviated IR, from now on, are 
cleared to zero. When the computation starts, the 
first or the left most bit is made 1. Whenever an 
iteration ends, it shifts this single 'one' one place 
to the right. The left bits are cleared. Thus only the 
second bit is '1' in the second iteration, and the 
third bit is '1' in the third iteration. In general, 
only kth hit is '1' in the kth iteration, rest of bits 
are zero. When this single '1' is shifted out of the 
last or the extreme right bit of the IR, it signifies 
that all the log 2 N iterations are over and the process 
stops. The log 2 R outputs of the IR control the address 
counter in a very simple way which we shall see in the 
next section. 

4.2 Address Counter 

As described in Section 1.6, the second address 
in a pair in the k"^^ iteration is obtained by inverting 
the kth hit of the first address in the pair. Since 
we already have an iteration register having only k^^ 
bit equal to '1' in the k"^^ iteration, the individual 
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bits of tbe second address can be obtained simply by- 
means of a set of log 2 ®^ 2-input OR gates, having the 
corresponding bits of the IR and the address counter 
as their inputs. Thus the address counter has to 
generate only the first address in a pair. 

Is we observed in Section 1.6, the bit of 
the first address in the k"*"^ iteration is always zero. 
The first addresses are in a na-txxral binary sequence 
if this zero bit is suppressed. This suggests that 
the address counter can be synthesized in a very 
simple way. The counter has log 2 R' bits and hence the 
same n-umber of flip-flops. Instead of connecting the 
output of each flip-flop only to the clock input of the 
next flip-flop as in a binary ripple counter, it is also 
connected to the clock input of the flip-flop after the 
next one in the line through an ARC gate, as shown in 
Figure 4«1« The other input of the ARD gate is the 
corresponding bit of the IR. For example, the input 
of the AND gate corresponding to the k"^^ most 
significant bit flip-flop is the k't^ most significant 
bit of the IR. In addition the inverse of the 
corresponding IR bit is connected to the clear terminal 
of the corresponding flip-flop. Thus if the k^^ bit 
of the IR is ’1' then the kth flip-flop is cleared for 
the complete duration of the kth iteration, thus making 
the kth bit of the first address permanently zero. It 
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also causes the corresponding AlID gate to connect the 
output of flip-flop to the clock input of the 

k+l"^^ flip-flop thus skipping the k"^^ flip-flop entirely. 
This counter counts in the desired sequence and generates 
the first address in a pair.' The second address is 
generated coiabinat orially and hence both addresses in a 
pair are available simultaneously. 

4*3 One of Two Selector 

Since the RAM is only single word read/write, 
each address has to fetch the real part and then the 
imaginary part. Thus a one of two selector is required 
for the two addresses. The first address is connected 
to the memory address input, the real part read and 
loaded in the corresponding register of the AU. Then 
imaginary part in read and loaded in the second register. 
Then the second address is connected and two accesses 
and loading done in a similar manner. Thus a total of 
four accesses per computation are required. After the 
AU has signalled end of computation, the real and 
imaginary paarts of both data words are stored in the 
memory one by one in a similar way using the one of two 
selector. '• 

4 . 4 Comnutation Counter 

A computation coxnxter with U/2 states or log 2 M-l 
bits is required to keep track of the computation currently 
being executed. This is a simple ripple carry binary 
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counter. Overflow of the computation counter signals 
the end of an iteration and causes the *1' in the 
iteration register to he shifted one place to the 
right. The output of computation counter is also 
used as input to some decoders to generate control 
commands as described in the next section. 

4.5 Generation of ’R* and *1* Commands 

The control unit has to signal to the AU, when 
an extra complex multiplication is to he performed to 
generate a new weight. This also involves sending an 
increment pulse to the ROM counter and a load pulse to 
the ROM register. ¥e will designate this ROM increment 
and load conmand as 'R' command. The other command is 
or ’interchange' command which signals the AU to 
interchange real and imaginary parts of the weight 
register and invert the sign hit of the real part. Whai 
hoth ■ of these commands are absent, the AU performs 
only the butterfly calculation. 

To generate the 'R' and command, the 
necessary logic can be determined by writing an example 
FFT flow chart for a small number of points say 32, 
making tables of the weights required in the different 
iterations, calculating the increments and when they 
are required, and detecting the logical patterns. These 
can be extrapolated for the higher point FFT. The final 
resiiLt for the 2K point FIT is given in Appendix Al. It 
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turns out that particvilay bit patterns of computation counter 
must be detected which are different for different iterations. 
This can be easily done by simple decoders. 

4 . 6 Other Control Commands 

Apart from the above mentioned control signals, the 
control unit has to generate some other signals like Read 
Mrmory, Write Memory, load the registers, operate the one of 
four selector in the AU when storing back the processed data 
in the memory and so on. After the data has been loaded in 
registers, the control unit sends a trigger pulse to the AU 
control which starts the execution. Control unit then 
becomes idle and waits till the AU control signals end of 
computation. Then it starts storing the data in the memory, 
incrementing address counter, computation counter and 
detecting end of iteration. 

4 . 7 Relative Speeds of AU and Main Control Unit 

The execution time of the main control unit is 
much smaller compared to the AU execution time. However, 
there is a maximum limit on the clock frequency at which 
the control can operate. This is because of the l^jisec. 
access time of the memory. The speed at which AU can 
operate is limited only by the Printed Circuit design, 
wiring delays and the speed of IC’s. Hence the clock of 
AU can be much faster than the clock of the main control. 

This will reduce the mismatch between the speeds of the 
AU and the control. 
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4 • 8 Modularity 

Since th.e signal flowgraph. of PET is highly 
syinmetrical and repetitive, the control unit can he made 
to act for various number of points in radix 2 PET. It can 
be easily made modular by a simple :switch which causes 
the initial to be loaded in different bits of the 

iteration register. Por example, loading the *1’ in the 
11th bit from the right most bit will make the control unit 
operate for 2K number of points. In general, loading the 
initial 1 in k^^ bit of the iteration register from the 
right most bit converts the control to work for 2 . number 
of points. Obviously the hardware must be initially 
designed for the highest number of samples to be processed. 
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Appendix A1 

GENERATION OE ’R' AND 'I' COMIANDS 

'I’ command causes the contents of ’g’ and ’h’ 
registers to he interchanged. This is not generated in the 
first iteration since only one weight is required. In 
the second iteration it is generated once, after half the 
computations (N/4) are over. This corresponds to the most 
significant hit of the computation counter being 1 and all 
the remaining less significant hits, (LSB’s) being 0. In 
the third iteration, ’I’ command is generated twice, once 
when N/8 computations are over, and then again after 3N/8 
computations are over. In this case the second most 
significant hit of computation counter is 1 and ISB's are 
0. The most significant hit may he either 0 or 1, i.e., 

in a ’don't care' condition. In general, in the 

* , 

iteration 'I' command is generated 2 /4 times, the first 
'I' command being given after N/2^ computations are over. 
This corresponds to the state in vjhich the (k-l)th most 
significant hit of the computation counter is 1 and the 
LSB's are 0, All the higher significant hits may he in 
'don't care' condition. A simple decoder, with the 
Iteration Register outputs and the computation counter 
outputs as its inputs, can generate this command. 
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•R’ command increments the ROM counter and loads 
the new incremental coefficient in the ROM buffer register. 
This command is not generated in the first two iteration 
since no new weight is req^uired. It is generated once in 
the third iteration, thrice in the fourth iteration and 
seven times in the fifth iteration. In general, in the 
iteration, it is generated (2^’’^-l) times. The 
occasion when it is generated for the first time corres- 
ponds to the state of the computation counter in which the 
(k-2)th most significant bit is 1 and the LSB*s 0. The 
more significant bits may be in 'don't care' condition. 

Thereafter the 'R' command is generated after every 
k— 1 

N/2 computations. This too can be s 3 rnthesized using 
a decoder similar to the one for 'I' command generation. 
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LIST OF COKTROI COMMANDS 

(1) Master clock. 

(2) Trigger pulse (AU enable). 

(3) AU complete (to main control from AU). 

(4) ’R' command. 

(5) 'I' command. 

(6) ROM coxinter increment pulse. 

(7) ROM coefficient load pulse. 

(8) ROM buffer register mode. 

(9) Mode 'c*. 

(10) Mode »d». 

(11) Mode 'e’. 

(12) Mode 'f’. 

(13) Load ‘c*. 

(14) Load 'd'. 

(15) Load ’e’. 

(16) Load ’f ‘ . 

(17) select ’c*. 

(18) Select ’d'. 

(19) Select 'e'. 

(20) Select 'f ' ♦ 

- y' * 

(21) ROM counter clear pulse. 

(22) Initialize 'g,b’ (at the start of each iteration) 
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(25) Toggle pulse to sign 'g’ flag (witli the I command). 

Master clock is fed continuously to the All. 

Trigger pulse enables the AU. After execution, AU control 
sends 'A'd complete’ signal to the main control. 'E.' and 
commands have been explained previously. ROM counter 
increment pulse increments the ROM counter. ROM coefficient 
load pulse loads the new incremental coefficient in the 
ROM buffer register. Mode commands are used while loading 
the data from the memory into the data registers- Load 
pulses write the data in the registers selectively. Select 
commands selects one of the four real numbers to be stored 
back in the memory one by one. ROM counter clear pulse 
and ’Initialize g,h' pulse are given at the start of 
eveiy iteration* Simultaneously with I command, a toggle 
pulse is given to the sign ’g* flag,-. 
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Appendix A3 
FABRICATIOIJ 


The complete arithmetic pnit is housed in a single 
cabinet contaiiiing two racks. Each rack has the capacity 
to hold sixteen printed circuit cards. The upper rack 
contains data registers, diode matrices constituting the 
ROM, ROM counter, ROM buffer register and 'one of four' 
selector. The lower rack contains the procesping hardware 
i.e. multipliers, adder, subtracter and associated gating 
and flags. It also contains the cards constituting the 
AU control. The total number of cards is 30, including 
connector cards which are used to bring connections from 
back side to front side and viceversa. 

There are only two panel switches, one for power 
and the other for clearing various sequential circuits at 
the start of the experiment. Two parallel data buses, 
each with 15 lines, connects the AU to the main memory. 

One data bus is for input and one for -output. One 23 line 
bus carries the various control commands to and from the mam 
control unit. 

The detailed circuit diagrams are given on the 
following pages. Circuit number A3. la and A3. lb are on 
the same card. Rest of the circuits correspond to a single 


card each. 
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Figure A5«2 
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